Fix usage of PMIx_Fence_nb to conform with PMIx Standard#13628
Fix usage of PMIx_Fence_nb to conform with PMIx Standard#13628janjust merged 1 commit intoopen-mpi:mainfrom
Conversation
|
@janjust This is timing out in some test NVIDIA runs, but I have no idea why and no way to test it myself (my Mac doesn't support UCX). I can try to do it in some kind of container, but would need advice on how. Otherwise, I'd just consider this as a suggestion PR and close it. |
|
@rhc54 let me give it a try |
|
I'm guessing update pmix sha is intended with this PR |
|
I removed the changes to common_ucx.c and now it passes. So there is something odd going on down there, but I don't have the ability to test it. Perhaps what is needed is to set the active flag to "false" in the case of a singleton, or when the fence isn't going to issue a callback? Might need some experimenting. |
|
yeah it now passes, because the initial PR, you didn't set the fenced in case of atomic completion so when it came to tear down it was stuck in the if(!fenced) the _nb code worked fine. |
|
See the other comment thread on #13626 (comment) - there is a necessary change to fix what @hppritcha observed. Your tests don't seem to hit that use-case. I'm just not including it here. I updated PMIx here because it was changed to strictly conform to the Standard - which it wasn't before. Hence, the spml/ucx component will now fail with these changes when running as a singleton. |
|
Please remove the PMIx sha update and include in a separate PR. |
hppritcha
left a comment
There was a problem hiding this comment.
please remove the pmix sha removal from this pr and open a separate one to do that.
The PMIx_Fence_nb function can return PMIX_OPERATION_SUCCEEDED to indicate that the function was executed atomically and the callback function will therefore not be called. The PMIx Standard lists a few reasons why this can happen, but the point here was to fix usage to properly handle that possibility. Signed-off-by: Ralph Castain <rhc@pmix.org>
The PMIx_Fence_nb function can return PMIX_OPERATION_SUCCEEDED to indicate
that the function was executed atomically and the callback function will
therefore not be called. The PMIx Standard lists a few reasons why this
can happen, but the point here was to fix usage to properly handle
that possibility.